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Abstract. A Bloom filter is a space efficient structure for storing static sets, where the space efficiency 
is gained at the expense of a small probability of false-positives. A Bloomier filter generalizes a Bloom 
filter to compactly store a function with a static support. In this article we give a simple construction 
of a Bloomier filter. The construction is linear in space and requires constant time to evaluate. The 
creation of our Bloomier filter takes linear time which is faster than the existing construction. We show 
how one can improve the space utilization further at the cost of increasing the time for creating the 
data structure. 



oo 

o 

O 



in 



> 

00 



1 Introduction 

A Bloom filter is a compact data structure that supports set membership queries [2] . Given a set S C D 

where D is a large set and 151 = n, the Bloom filter requires space 0(n) and has the following properties. 

It can answer membership queries in O(l) time. However, it has one-sided error: Given x S S, the Bloom 

^ . filter will always declare that x belongs to S, but given x € D\S the Bloom filter will, with high probability, 

declare that x £ S. Bloom filters have found wide ranging applications [4, 5, 14, 16, 19, 20]. There have also 

been generalizations in several directions of the Bloom filter [8, 13, 21, 23]. More recently, Bloom filters have 

been generalized to "Bloomier" filters that compactly store functions [7]. In more detail: Given S C D and 

a function / : S — > {0, l} k a Bloomier filter is a data structure that supports queries to the function value. 

It also has one-sided error: given x € S, it always outputs the correct value f(x) and if a; € D\S with high 

Q\ . probability it outputs '_!_', a symbol not in the range of /. In [7] the authors construct a Bloomier filter that 

^D ' requires, 0(n\ogn) time to create; 0(n) space to store and, O(l) time to evaluate. 

In this paper we give an alternate construction of Bloomier filters, which we believe is simpler than that 
of [7]. It has similar space and query time complexity. The creation is slightly faster, 0{n) vs. 0{n log n). 
Changing the value of f(x) while keeping S the same is slower in the worst case for our method, O(logn) 
vs. 0(1). For a detailed comparison we direct the reader to §6. In §3 we discuss another construction that 
is very natural and has a smaller space requirement. However, this algorithm has a creation time of 0(n 3 ) 
which is too expensive. In §4 we discuss how bucketing can be used to reduce the construction time of this 
algorithm to n log ^ ' n and make it more practical. In §7 we discuss some experimental results comparing 
the existing construction to ours for storing the in-degree information for a billion URLs. 

2 The construction 

2.1 A 1-bit Bloomier Filter 

We begin with the following simplified problem: Given a set S of n elements and a function / : S — > {0, 1}, 
encode / into a space efficient data structure that allows fast access to the values of /. A simple way to 
solve this problem is to use a hash table (with open addressing) which requires 0(n) space and 0(1) time on 
average to evaluate /. If we want worst case 0(1) time for function evaluation, we could try different hash 
functions until we find one which produces few hash collisions on the set S. This solution however does not 
generalize to our ultimate goal which is to have a compact encoding of the function /:£)—> {0, 1, _L}, where 
f\s = f and f(x) = _L with high probability if x (£ S. Thus if D is much larger than S, the solution using 
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hash tables is not very attractive as it uses space proportional to D. To counter- act this one could use the 
hash table in conjunction with a Bloom filter for S. This is not the approach we will take 1 . 

Our approach to solving the simplified problem uses ideas from the creation of minimal perfect hashes (see 
[9]). We first map S onto the edges of a random (undirected) graph G(V,E) constructed as follows. Let V 
be a set of vertices with \V\ > c\S\, where c > 1 is a constant. Let h\,hi : D — > V be two hash functions. For 
each x £ S, we create an edge e = (h\(x), ti2{x)) and let E be the set of edges formed in this way (so that 
l-^l = \&\ = n )- If the graph G is not acyclic we try again with two independent hash functions h\, h' 2 . It is 
known that if c > 2, then the expected number of vertices on tree components is \V\ + 0(1) ([3] Theorem 
5.7 ii). Indeed, in [10] the authors proved that if G(V, E) is a random graph with \V\ = c\E\ and c > 2, then 
with probability exp(l/c)-\/(c — 2)/c the graph is acyclic. Thus, if c > 2 is fixed then the expected number 
of iterations till we find an acyclic graph is 0(1). In particular, if c > 2.09 then with probability at least 1/3 
the graph G is acyclic. Thus the expected number of times we will have to re-generate the graph until we 
find an acyclic graph is < 3. Once we have an acyclic graph G, we try to find a function g : V — > {0, 1} such 
that f(x) = g(h\(x)) + g(fi2(x)) (mod 2) for each x € S. One can view this as a sequence of n equations 
for the variables g(v), v E V. The fact that G is acyclic implies that the set of equations can be solved by 
simple back-substitution in linear time. We then store the table of values g(v) (€ {0, 1}) for each v E V. 
To evaluate the function /, given x, we compute h\(x) and h,2{x) and add up the values stored in the ta- 
ble g at these two indices modulo 2. The expected creation time is 0(n), evaluation time is 0(1) (two hash 
function computations and two memory lookups to the table of values g) and the space utilization is \cn\ bits. 

Next, we generalize this approach to encoding the function / : D — ► {0, 1, _L} that when restricted to S agrees 
with / and outside of S it maps to _L with high probability. Here again we will use the same construction 
of the random acyclic graph G(V, E) together with a map from S — > E via two hash functions hi,h,2. Let 
m > 2 be an integer and h^ : D — > Z/mZ be another independent hash function. We solve for a function 
g : V — > Z/toZ such that the equations f(x) = g(h\(x)) + g(h,2{x)) + hs(x) (mod m) holds for each x € S. 
Again since the graph G is acyclic these equations can be solved using back-substitution. Note that back- 
substitution works even though we are dealing with the ring Z/mZ which is not a field unless m is prime. To 
evaluate the function / at x we compute hi(x) for 1 < i < 3 and then compute g(h\(x)) + g(h,2(x)) + hs(x) 
(mod m). If the computed value is either or 1 we output it otherwise, we output the symbol _L. Algorithms 
1 and 2 give the steps of the construction in more detail. It is clear that if x £ S then the value output by our 
algorithm is the correct value f(x). If x ^ S then the value of h 3 (x) is independent of the values of g(h\(x)) 
and g(h,2{x)) and uniform in the range Z/mZ. Thus Pr xeD \ S {g(hi(x)) + g(h,2{x)) + h^(x) £ {0, 1}] = — . 
In summary, we have proved the following: 

Proposition 1. Fix c > 2 and let m > 2 be an integer, the algorithms described above (Algorithms 1 
and 2) implement a Bloomier filter for storing the function f : D — > {0, 1, _L} and the underlying function 
f : S — > {0, 1} with the following properties: 

1. The expected time for creation of the Bloomier filter is 0(n). 

2. The space used is \cn\ [~log 2 m] bits, where n = \S\. 

3. Computing the value of the Bloomier filter at x G D requires O(l) time (3 hash function computations 
and 2 memory lookups). 

4- Given x G S, it outputs the correct value of f(x). 
5. Given x ^ S , it outputs _L with probability 1 — — . 

2.2 General fc-bit Bloomier Filters 

It is easy to generalize the results of the previous section to obtain Bloomier filters with range larger than just 
the set {0, 1}. Given a function / : S — ► {0, l} k it is clear that as long as the range {0, l} k embeds into the ring 



1 The reason this is not optimal is because to achieve error probability e, we will need to evalute 0(log 1/e) hash 
functions. 



Algorithm 1 Generate Table 



Input: A set S C D and a function / : S — > {0, 1}, c > 2, and an integer m > 2. 

Output: Table <; and hash functions hi, /12, h% such that Vs € S : fl , [fti(s)] + g[/i2(s)] + /l3(s) = /(s) mod m. 

bet V" = {0, 1, • ■ ■ , \cn\ — 1}, where n — \S\ 

repeat 

Generate hi , /12 : D — > I 7 where /li are chosen independently from Ti — a family of hash functions; bet -E 

{(fci(*)>M*)) :s£S}. 
until G(V, E) is a simple acyclic graph. 

bet hz : D — » Z/mZ be a third independently chosen hash function from 'H. 
for all T - a connected component of G(V^ i?) do 
Choose a vertex v £T whose degree is non-zero. 
F .- {v}; ff [ V ] <- 0. 
while F/Tdo 

bet G be the set of nodes in T\F adjacent to nodes in F. 
for all w = hi(s) G G do 

9M «- /( s ) - 9[h3-i(s)] - h 3 (s) mod m. 
end for 
F^FUC. 
end while 
end for 



Z/mZ one can still use Algorithm 1 without any changes. This translates into the simple requirement that we 
take m > 2 k . Algorithm 2 needs a minor modification, namely, we check if / = g(h\(x)) + g(li2(x)) + h^(x) 
(mod m) G {0, 1} and if so we output / otherwise, we output _L. We encapsulate the claims about the 
generalization in the following theorem (the proof of which is similar to that of Proposition 1): 

Theorem 1. Fix c > 2 and let m > 2 k be an integer, the algorithms described above implement a Bloomier 
filter for storing the function f : D — > {0, 1} U {-L}, and the underlying function f : S — > {0, 1} with the 
following properties: 

1. The expected time for creation of the Bloomier filter is O(n). 

2. The space used is \cn\ [~log 2 rn\ bits, where n= \S\. 

3. Computing the value of the Bloomier filter at x G D requires 0{1) time (3 hash function computations 
and 2 memory lookups). 

4- Given x € S, it outputs the correct value of f{x). 
5. Given x ^ S, it outputs _L with probability 1 — — . 



Algorithm 2 Query function 



Input: Table g, hi,hi : D — > {0, • ■ ■ , \cn~\ — 1}, /13 : D — ► Z/mZ hash functions and x G D. 
Output: 0, 1 or _L - the output of the Bloomier filter represented by the table g. 
f <— g[hi(x)] + g[h 2 (x)] + h 3 (x) mod m. 
if / G {0, 1} then 

Output /. 
else 

Output _L. 
end if 



2.3 Mutable Bloomier filters 

In this section we consider the task of handling changes to the function stored in the Bloomier hlter produced 
by the algorithms in the previous section. We will only consider changes to the function / : S — > {0, l} k 



where S remains the same but only the values taken by the function changes. In other words, the support 
of the function remains static. 

Consider what happens when / : S — > {0, l} k is changed to the function /' : S — > {0, l} fc where f(x) = f'(x) 
except for a single y £ S. In this case we can change the values stored in the g-table so that we output the 
value of/' at y. We assume that the edges of the graph G are available (this is an additional 0(n log n) bits). 
We begin with the observation that the values stored at g(v) for vertices v not in the connected component 
containing the edge e = (hi(y),fi2(y)) remain unaffected. Thus changing / to /' affects only the g values 
of the connected component, C (say), containing the edge e. Recomputing the g values corresponding to C 
would take time 0(|C|). How big can the largest connected component in G get? Our graph G(V,E) built 
in Algorithm 1 is a sparse random graph with \E\ < ^\V\. A classical result due to Erdos and Renyi says 
that in this case the largest component is almost surely 2 O(logn) in size where n = \E\ (see [12] or [3]). 
Thus updates to the Bloomier filter take O(logn) time provided we ensure that the largest component in G 
is small when creating it. The result from [12] tells us that adding the extra condition while creating G will 
not change the expected running time of Algorithm 1. We call this modified algorithm Algorithm 1'. 

Theorem 2. The Bloomier filter constructed using algorithms V and 2 can accomodate changes to function 
values in time O(logn), provided the graph G is also retained. Moreover, the claims of Theorem 1 remain 
true for algorithms 1 ' and 2. 

3 Reducing the space utilization 

If we are willing to spend more time in the creation phase of the Bloomier filter, we can further reduce the 
space utilization of the Bloomier filter. In this section we show how one can get a Bloomier filter for a function 
/ : S — ► {0, l} k with error rate — using only n(l + e)[log 2 m] bits of storage, where n = \S\ and e > is 
a constant. In §2 we used a random graph generated by hash functions to systematically generate a set of 
equations that can be solved efficiently. The solution to these equations is then stored in a table which in turn 
encodes the function /. The main idea to reduce space usage further is to have a table g[0], g[l], ■ ■ ■ ,g[N—l], 
where N = (1 + e)n, and try to solve the following set of equations over Z/mZ: 



^ h l (x)g[h i+s (x)]\ +ho(x) = f(x), xeS (1) 



for the unknowns <?[0], • • • , g[N — 1]. Here s > 1 is a fixed integer and ho, hi, ■ ■ ■ , /i2s are independent hash 
functions. Since s is fixed, look up of a function value will only take O(l) hash function evaluations. These 
equations can be solved provided the determinant of the sparse matrix corresponding to these equations is 
a unit in Z/mZ. The next subsection gives an answer (under suitable conditions) to this question when m 
is a prime. 

3.1 Full rank sparse matrices over a finite field 

Let GL* xr (F p ) be the set of full rank n x r matrices over F p 3 that have exactly s non-zero entries in each 
column. Our aim in this section is to get a lower bound for t)GL^ xr (F p ) (the cardinality of this set). We note 
the following lemma whose proof we omit. 

Lemma 1. Let M^ xr (¥ p ) be the matrices over¥ p where each column has exactly s non-zero entries. Then 
PC xr (F p ) = Op-l) s ) r . 



2 This means that the probability that the condition holds is 1 — o(l). 

3 Here p is a prime number and F p is the finite field with p elements. Any two finite fields with p elements are 
isomorphic and the isomorphism is canonical. If the field has p r , r > 1, elements then the isomorphism is not 
canonical. 



Before we begin the task of getting a lower bound for the sparse full rank matrices we briefly recall the 
method of proof for finding jjGL„(F p ) - the group of invertible nx n matrices over F p . One can build invert- 
iblc matrices column by column as follows: Choose any non-zero vector for the first column, there are p n — 1 
ways of choosing the first column. The second column vector should not lie in the linear span of the first. 
Therefore there are p n — p choices for the second column vector. Proceeding in this way there are p n — p } 
for the j + 1 column. Thus we have t)GL„(F p ) = Yli<j< n {p n — P n ~^)- 

One can adapt this idea to get a bound on the invertible s-sparse matrices. There are (™)(p — l) s ways of 
choosing the first column. Inductively, suppose we have chosen the first i columns to be linearly independent, 
then we have a vector space V. C F™ of dimension i spanned by the first i columns. One can grow this matrix 
to a rank i + 1 matrix by augmenting it by any s-sparse vector w ^ Vi. Thus we are faced with the task 
of finding an upper bound on the number of s-sparse vectors contained in Vi. We introduce some notation: 
suppose w = (jdi,W2, • • • , m„)' € F™ is a vector then we define w to be the vector (w n ,Wi, • • • , Wn_i)* 
(a cyclic shift of w) . Note that if w is s-sparse then so is w . Our approach is to show that under certain 
circumstances the vector space spanned by the orbit of a sparse vector under the circular shifts have high 
dimension and consequently, all the shifts cannot be contained in Vi (unless i — n). It is natural to expect 
that given a s-sparse vector w, the vector space W® spanned by all the circular shifts w, w , • • • ,w 
has dimension > n — s. Unfortunately, this is not so: For example, consider w = (1, 0, 1, 0, 1, 0) whose cyclic 
shifts generate a vector space of dimension 2. This motivates the next lemma. 

Lemma 2. Suppose q is a prime number and weF' is an s-sparse vector with < s < q. Then the orbit 

{w, w , • • • , w } has cardinality q. 

Proof. We have a natural action of the group TLj qL on the set of cyclic shifts of w, via a <— ► w . Suppose 
we have w = w for 0<i^j<q— 1. Then we have w = w = w . Since we have a group 

action this implies that w ' = w. Since q is prime this means that w = w. But < s < q therefore 

w ^ w and we have a contradiction. □ 



One can show that the vector space spanned by the cyclic shifts of an s-sparse vector (0 < s < n) has 
dimension at least n/s. However, this bound is not sufficient for our purpose. We need the following stronger 
conditional result whose proof is relegated to the appendix (see Theorems 8 and 9 in the Appendix) . 

Theorem 3. Let w = (wo, • • • ,w q -i) G F!? 7 where p is a prime that is a primitive root modulo q (i.e., p 
generates the cyclic group ¥*). Suppose wq + wi ■ ■ - + w q -i =/= and Wi are not all equal, then W® (the vector 
space spanned by the cyclic shifts ofvr) has dimension q. 

Let Vi be a vector space of dimension i contained in F£. We have - (*) (p— l) s orbits of size q under the action 
of Z/gZ on the s-sparse vectors. If s < n then all the coordinates cannot be identical. Once the s non-zero 
positions for an s-sparse vector are chosen there are > (p — l) s — (p — l) s ~ 1 vectors whose coordinates do 
not sum to zero 4 . Now each of these orbits generates a vector space of rank q by the above theorem. In each 
orbit there are at most i vectors that can belong to V.. Consequently, there are at least 



^((p-iy-( p -iy-i)( q -i) 



(2) 



s-sparse vectors that do not belong to V,. We have thus proved the following theorem: 

Theorem 4. Let q,p be prime numbers such that p is congruent to a primitive root modulo q. Then 



flGL^ xr (F p ) > [] (\ (f) ((P - !) S " (P " !) S_1 ) (« 

0<i<r-l ^ q ^ ' 



% 



4 Indeed, it is not hard to show that the exact number of such vectors is -. 



We note that the bound obtained above is almost tight 5 in the case s = 1, where the 1-sparse matrices are 
simply diagonal matrices (with non-zero entries) multiplied by permutation matrices. 

3.2 The Algorithm 

The outline of the algorithm is as follows. To create the Bloomier filter given / : S — ► {0, f } k , we consider 
each element x of S in turn. We generate a random equation as in (1) for x and check that the list of 
equations that we have so far has full rank. If not, we generate another equation using a different set of 2s 
hash functions. At any time, we keep the hash functions that have been used so far in blocks of 2s hash 
functions. When generating a new equation we always start with the first block of hash functions and try 
subsequent blocks only if the previous blocks failed to give a full rank system of equations. The results of 
the previous section show that the expected number of blocks of hash functions is bounded (provided the 
vector space has high dimension). Once we have a full rank set of equations for all the elements of S, we 
then proceed to solve the sparse set of equations. The solution to the equations is then stored in a table. 
At look up time, wc generate the equations using each block of hash functions in turn and output the first 
time an equation generates a value in the range of /. At first glance it looks like this approach stores / with 
two-sided error, i.e., even when given x G S we might output a wrong value for f(x). However, we show that 
the probability of error committed by the procedure on elements of S can be made so small that, by doing 
a small number of trials, we can ensure that wc do not err on any element of S. 



Algorithm 3 Setup parameters 

Input: n > integer given in unary, m > integer, e > 0. 

Output: q and p primes, p is a m-bit prime that is a primitive root modulo q. 

Let q be the first prime > n(l + e). 

Factor q — 1 and let qi, ■ • • ,qk be the (distinct) prime factors of q — 1. 

repeat 

Choose a random g £ ¥ q . 

until g qi ^ 1 mod q for each 1 < i < k. 

Let gi — g % mod q for 1 < i < q — 1, gcd(i, q — 1) = 1. 

repeat 

Choose a random m-bit integer p. 

until p = gi for some i, and p is prime. 



Analysis of Algorithm 3: The first step of the algorithm finds the smallest prime larger than n(l + e). 
The prime number theorem implies that the average gap between primes is Inn, this means that on average 
q < n(l + e) + Inn. Let n(x) be the prime counting function. Then from the prime number theorem 7r(x + 
Sx) — tt(x) ~ j^2L, for any fixed S > (much stronger results are known, see [1]). Thus, if n is large enough 
q < (1 + e') n f° r an y e ' > e - Since n is provided in unary, the algorithm can factor q — 1 in linear time to 
obtain the prime factors qi,- ■ ■ ,qt- The loop following this step finds a primitive root modulo q. Since F* 
is cyclic of order q — 1, standard group theory tells us that there are ip(q — 1) such generators, where ip is 
Euler's totient function. It is known that f{n) ^> lo || (see [24]) and so the expected number of times 
the loop runs is O (log log q). Once a generator g is found, the algorithm computes the list gi that are all 
generators of the group F* This step requires 0(n log 2 n) as we do 0(n) arithmetic operations over the field 
W q . The final loop of the algorithm attempts to find a random prime p = gi. By the prime number theorem 
for arithmetic progressions the number of primes that are = g^ mod q, for some i, below a bound x is 
aymptotic to ? _7\ i ■ Thus a random number in the interval [1 • • • x] satisfies the termination condition for 

the loop with probability , _*L , ' . However, we seek a prime of m-bits and so we pick numbers at random in 



5 The bound is tight if we use the exact formula for the number of s-sparse vectors that do not sum to in the 
derivation. 



the interval [2 m 1 ■ ■ ■ 2 m — 1]. Again, by the prime number thorem for arithmetic progressions, the number 
of primes in this interval is ~ 2 f -'ivi"na:-in2l ' wnere x = 2 m — 1. This tells us that the expected number 
of iterations of the loop is about — ( -1) ~ = 0(mloglogq). We can use a probabilistic primality test 
to check for primality of the random m-bit numbers that we generate. If we use the Miller-Rabin primality 
test (from [22]) the expected number of bit-operations 6 is 0(m 4 ). In summary, the expected running time 
of Algorithm 3 is 0(n + m ). Note that m will be very small in practice (a prime of about 64-bits should 
suffice). 



Algorithm 4 Create Table 

Input: A set SCD and a function / : S — > {0, l} k , two primes p, q, Ti a hash family, and s > 2. 
Output: Table g, ho a hash function and r blocks of 2s hash functions Bi. 
M <— (0) n xq (a n x q zero matrix). 
Let ho be a random hash function from 7i. 
i*-0. 

for all x £ S do 
it-i + 1; j <— 0. 
repeat 



Generate hi , ■ ■ ■ 


,h 2a 


random hash functions from TC. 


Bj <— {hi,--- ,h 2a }. 
end if 










Let hi, ■ • • , h'2 S be 


the hash functions 


in Bj 






M[i,h k +s{x)) ^h 
until Rank(M) = i 
end for 


k(x) 


for 1 < k < s; 


j *-3 


+ 1. 




Let v = (f(x) — ho(x) 
Solve the system M x 
Return g, ho and Bi. 


: x e 
g = 


sy. 

v for g = (g[i\ 


:l<i 


<g)* 


over F p 



Analysis of Algorithm 4: The algorithm essentially mimics the proof of Theorem 4. It starts with a rank 
i matrix and grows the matrix to a rank i + 1 matrix by adding an s-sparse row using hash functions in Bj 7 . 
Let n = |5| and suppose, q > n(l + e) for a fixed e > 0. Then equation (2) tells us that in 0(1/ e) iterations 
we will find that the rank of the matrix increases. In more detail, the probability that a random s-sparsc 
vector does not lie in Vi is at least ^- > e since i < n and q > n(l + e). Note that this requires rather strong 
pseudorandom properties from the hash family TC. As mentioned in the discussion following Lemma 4.2 in 
[7], a family of cryptographically strong hash functions is needed to ensure that the vectors generated by 
the hash function from the input behave as random and independent sparse vectors over the finite field. We 
will make this assumption on the hash family TL. Checking the rank can be done by Gaussian elimination 
keeping the resulting matrix at each stage. The inner-loop thus runs in expected 0(n 2 ) time and the "for" 
loop takes 0(n 3 ) time on average. Solving the resulting set of sparse equations can be done in 0(n 2 ) time 
since the Gaussian elimination has already been completed. The algorithm also generates r blocks of hash 
functions, and by the earlier analysis the expected value of r is 0(1/ e). In summary, the expected running 
time of Algorithm 4 is 0(n 3 ). We refer the reader to the appendix for a discussion on why sparse matrix 
algorithms cannot be used in this stage, and also why s — 1 cannot be used here. 

Analysis of Algorithm 5: In this algorithm we try the blocks of hash functions and output the first 
"plausible" value of the function (namely, a value in the range of the function /). If the wrong block, Bi, 
of hash functions was used then the probability that the resulting function value, y, belongs to the range 



6 The soft-Oh notation, O, hides factors of the form log log n and logm 
Strictly speaking the row could have < s non-zero entries because a hash function could map to zero. But this 
happens with low probability. 



{0, l} fc is — . If the right block Bi was used then, of course, we get the correct value of the function and 

y = f(x). If x E D\S, then again the probability that y e {0, l} fc is at most — . Since r and s are 0(1) the 

algorithm requires 0(1) operations over the finite field ¥ p . This requires 0(log p) bit operations with the 
usual algorithms for finite field operations, and only O(logploglogp) bit operations if FFT multiplication is 
used. 



Algorithm 5 Query function 



Input: Table g, hash functions ho,Bi,l<i<r,x£D. 
Output: y e {0, l} k or _L. 
i<- 1 
while i < r do 

Let hi,-- • , /i2s be the hash functions in Bi. 
Let y <- ho(x) + Ei<j< s hi(x)g[h i+s (x)]. 
if y e {0, l} k then 

Return y. 
end if 
it-i + 1 
end while 
Return _L. 



How to get one-sided error: The analysis in the previous paragraph shows that the probability that 
we err on any element of S is < **=-. Thus, if p is large we can construct a g table using Algorithm 4 and 
verify whether we give the correct value of / for all elements of 5*. If not, we can use Algorithm 4 again 
to construct another table g. The probability we succeed at any stage is > 1 — ^-, and if p is taken large 

enough that this is > \, then the expected number of iterations is < 2. We summarize the properties of the 
Bloomicr filter constructed in this section below: 



Theorem 5. Fix e > and s > 2 an integer, let S C D, \S\ = n and let to, k be positive integers such that 
m > k. Given f : S — > {0, 1} , the Bloomier filter constructed, (with parameters e,m and s) by Algorithms 

3 and 4, and queried, using Algorithm 5, has the following properties: 

1. The expected time to create the Bloomier filter is 0(n 3 + to 4 ). 

2. The space utilized is |~n(l + e)]m bits. 

3. Computing the value of the Bloomier filter at x G D requires O(l) hash function evaluations and 0(1) 
memory look ups. 

4- If x <s S, it outputs the correct value of f(x). 

5. If x ^ S, it outputs _L with probability 1 - 0{\2 k ~ m ). 

4 Bucketing 

The construction in §3 is space efficient but the time to construct the Bloomier filter is exhorbitant. In 
this section we show how to mitigate this with bucketing. To build a Bloomier filter for / : S — > {0, 1} , 
one can choose a hash function g : S — > {0, 1, ■ ■ ■ ,6—1} and then build Bloomicr filters for the functions 
fi ■ Si — > {0, l} fe , for i = 0, 1, • • • , b — 1, where Si = 5 _1 (i) and f t (x) = f(x) for x <G Si. The sets \S t \ have 
an expected size of \S\/b and hence results in a speedup for the construction time. The bucketing also allows 
one to parallelize of the construction process, since each of the buckets can in processed independently To 
quantify the time saved by bucketing we need a concentration result for the size of the buckets produced by 
the hash function. 



Fix a bucket bi, < bi < b and define random variables Xg t , ■ ■ ■ ,Xs„ for Sj € S as follows: Pick a hash 
function g : S — > {0, 1, • • • , b — 1} from a family of hash functions Ji. and set X s - % ~ 1 if g(s.,-) = bi and 
set -Xs/ = otherwise. Under the assumption that the random variables X\* are mutually independent, 
we obtain using Chcrnoff bounds that Pr 

holds for any bucket and consequently, Pr 



L^Xi + C 
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< 2 b provided S > 2e — 1 . This bound 



3j : £,X.™>(l + *)f 



,5 IS 



< 62 s . Thus with probability 



> 1 — 62 (> all the buckets have at most (1 + <5)M elements. Suppose we take the number of buckets 6 
to be clo Li for c > 1. Then the probability that all the buckets are of size < c(l + ^)log|S'| is at least 
1 _ 2 (-c5iog|s|+iog|S|-ciogio g |S|) which for large 2^^ s is > 1/2. In other words, the expected number of 
trials until we find a hash function g that results in all the buckets being "small" is less than 2. 

Remark 1. Note that the assumption that the random variables X s . be mutually independent requires the 
hash family to have strong pseudorandom properties. For instance, if TL is a 2-universal family of hash 
functions then the random variables X s . are only pairwise independent. 

In the following discussion we adopt the notation from Theorem 5. We assume that we have a hash function 
g that results in all buckets have 0(log n) elements. The time for creation of the Bloomier filter in §3 for 
each bucket is reduced to 0(log n + r A ). To query the bucketed Bloomier filter, given x, we first compute 
the bucket, g(x), and then query the Bloomier filter for that bucket. Thus, querying requires one more hash 
function evaluation than the non-bucketing version. Suppose n, is the number of elements of S that belonged 
to the bucket defined by bi, then the Bloomier filter for this bucket requires [~nj(l + e)~|r bits. The total num- 
ber of bits used is J2a<i<b\ n i(^ + e )l r < Eo<i<6 ( n *(l + e) + l)r = n(l + e)r + br, since J2i n i = n - Since 
the number of buckets is 0(nj logn), the number of bits used is n(l + e)r + 0(rnj log n). 

We summarize the properties of the bucketing variant of the construction in §3 in the following theorem. 

Theorem 6. Fix e > and s > 2 an integer, let S C D, \S\ — n and let m,k be positive integers such 
that m > k. Given f : S — > {0, l} k , bucketed using |S'|/clog|<S'| buckets for a fixed c > 1, the Bloomier 
filter constructed on the buckets, (with parameters e, m and s) by Algorithms 3 and 4-, and queried (on the 
buckets), using Algorithm 5, has the following properties: 

1. The expected time to create the Bloomier filter is 0( , " (log n + m 4 )). 

2. The space utilized is n(l + e)m + 0(mn/ logn) bits. 

3. Computing the value of the Bloomier filter at x G D requires O(l) hash function evaluations and 0(1) 
memory look ups. 

4- If x G S, it outputs the correct value of f(x). 

5. If x $ S, it outputs _L with probability 1 — 0(-2 k ~ m ). 



5 A remark on the use of sparse matrices 

There are more efficient algorithms for computing the rank of a sparse matrix and for inverting such matrices 
(see [25, 17, 18]). However, these algorithms do not lend themselves to an incremental operation: Fix s, given 
an s-sparse r x n matrix M whose rank has already been computed by any of these algorithms, compute the 
rank of an s-sparse (r + 1) X n matrix that contains M in the first r rows. To reduce the running time of 
Algorithm 4 to 0(n 2 ), one would need to solve the above problem in 0(n) time. Checking that an s-sparse 
n x n matrix over ¥ p is full rank can be done in 0(n 2 ) time. However, since the s-sparse invertible matrices 
are relatively rare (consider the case when s is small, for instance, 1) we cannot simply pick a random s-sparse 
matrix and have a resonable probability of it being invertible. This is why we have to build the matrix in 
stages as we do in the above algorithm. 
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One may wonder if the construction works for s — 1, because in this case the matrix inversion is easy. 
However, the analysis of Algorithm 4 fails in this case. In more detail, if s = 1 then each block of hash 
functions contains two hash functions. Out of these, only one of them determines the "variable" g[i) to be 
used for an element. The construction then attempts to implicitly create a matching between the n elements 
of S to these g[i] of which there are n(l + e) using r hash functions. If the number of hash function blocks 
r were 0(1), then in particular there is a hash function hj that creates a matching for n/r elements. This 
implies that a hash function is able to create an injection of a set S' with n/r elements into a set of size 
n(l + e). However, the probability that this event occurs is exponentially low if r — o(y / n), since by the 
Birthday paradox the expected number of colliding pairs in such a construction is y'J )/n(l + e). 

6 Comparison with earlier work 

Our constructions in §2 and §3 and those of [7] are related in the broad sense that all approaches use hash 
functions to generate equations that can be solved to construct a Bloomier filter. However, the details differ 
markedly Let S C D and / : S — + {0, l} fc be the function that we wish to store. The approach of [7] is 
as follows: They first show that if we have r hash functions (we need r > 1 see the discussion below), then 
for each element x £ S we can single out a hash value (hi( x )(x)) which does not collide with the chosen 
hash values for the other elements. They prepare a table that stores the mapping from x to the chosen hash 
function [i t— » i(x)) efficiently, and then look up a second table using the hash function indicated by the 
first table. A bit-mask computed using another hash function is used to provide error resiliency. To find the 
"chosen" hash value for each element requires a matching problem to be solved. For the matching problem 
to have a solution we need at least two hash functions. Indeed, if we had only one hash function then the 
Birthday paradox implies that there will be collisions among the elements (since the hash functions map the 
set S of size n to a set of size ~ n). For the colliding elements we cannot select the "chosen" hash value. 
Provided r > 1, they show that the matching problem can be solved in O(nlogn) time on average. The 
space used is rc(q + k)n bits, where r > 2, c > H — j= and q — log - (here e is the probability of an error 
given x € D\S). Look up requires r + 1 hash function evaluations and r + 1 memory accesses (r accesses to 
Tablei and one access to Table-i in the notation of [7]). Since r > 2, we need at least 3 memory accesses. 
More importantly, their construction allows changes to the function value in the same time as a look up. 

The method from §2 can be constructed in linear time on average which is faster than [7]. The space uti- 
lization is similar sw (2 + S)n(log- + k) (for any 6 > 0). Changing the value of f(x) for x G S is slower 
taking O(logn) time. Looking up a function value requires 3 hash evaluations and 2 memory accesses (which 
is slightly faster than their scheme). The method from §3 is more efficient in the storage space than both 
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methods. It requires only (1 + S)n(\og - + k) space for any fixed 5 > 0. Look up time is still 0(f), but cre- 
ation time is an exhorbitant 0(n 3 ). The bucketing approach from §4 reduces this to nlog ^ ' n. We believe 
it would be an interesting problem to construct Bloomier filters that require cn(log - + k) bits of storage for 
c < 2, while allowing a look up time of O(l) and a creation time of 0(n). 

Note: Some recent independent results of [If] are much closer to our approach. They use results of [6] to get 
a bound on the number sparse invcrtible matrices over the field F2 and, consequently, encode the function. 

7 Experimental Results 

In this section we discuss the results of some experiments that we ran comparing our construction of Bloomier 
filters (from §2 ) and the scheme of [7]. The function we store is InDeg that maps a URL to the number of 
URLs that link to it. We obtained the in-link information for little over a billion URLs from the Live Search 
crawl data. We measured the creation time and memory usage for both the schemes for various numbers of 
URLs and averaged the results over 45 trials. The results are graphed in Figure 1. Figure la shows the time 
taken by both methods for creation of the Bloomier filter (we used c = 2.5 and error probability = 2~ 32 
for both the schemes). Figure lb displays the number of trials by the creation phase of each algorithm to 
find an appropriate graph (acyclic in our algorithm and lossless expander in theirs). As one can see from the 
results, the number of trials until a lossless expander is found is about the same as that of finding an acylic 
graph. However, it takes comparatively longer to find a matching in the graph. Our scheme ends up being 
between 3 to 5 times faster for creation of the filter as a result (see lc). Also, the memory foot print of the 
constructed Bloomier filter in both schemes is similar allowing the in-link information for « l.lx 10 9 URLs 
to fit in w 20GB. 
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A Circulant matrices over finite fields 

Definition 1. Let V be an n- dimensional vector space over a field F and let w = (wo, w\, ■ ■ ■ , w n _i) G V . 
A circulant matrix associated to w is the n x n matrix 

( Wq wi ■ ■ ■ W n - 2 W n -l\ 
W n -i Wq ■ ■ ■ W n - 3 W„-2 

U>2 U>3 • • • Wo w l 
\ Wi W 2 ■■■ W n -i W J 

The following results are from [15], however, the proofs need some modification since we are dealing with 
vector spaces over finite fields. 

First we need a closed form for the determinant of a circulant matrix. 

Theorem 7. Let p be a prime and n a positive integer relatively prime to n. Let w = (wo, w\, ■ ■ ■ , w n _i) 
be a vector in F™ and let W be the circulant matrix associated to w. Then 



detW = Yl Y^ •■' "' 



*.., 



0<£<n-l \0<j<n-l 

where e is a primitive n-root of unity contained in the algebraic closure F p . 

Proof. One can view W as a linear transformation acting on the vector space F . An explicit calculation 
shows us that the vectors X£ = (1, e e , e 2e , ■ ■ ■ , e^" -1 ^), < t < n — 1, are all eigenvectors with eigenvalues 

\ e = w + e e Wl H h e (n_1)£ w n _i. 

Since {x , • • • , x^J is a linearly independent set, we conclude that 

detW = Yl Xe - 

0<£<n-l 

It is not immediately apparent that the product is actually in ¥ p . To show that one looks at the smallest 
field F p r that contains e. The Galois group of this field Gal(F p r/F p ) is cyclic of order r, indeed, r is the 
smallest integer such that p r = 1 mod n. The Galois group is generated by the Frobenius map, Fr, that 
sends x i-^f x p . Under this map, e h^ e p = e s where s = p mod n. Consequently, Fr(A£) = X^ where I' = pi 
mod n. Since gcd(n,p) = 1 the Frobenius just permutes the terms of the product IIo<ri<n-i ^- ^ ne product 
Ilo<^<n-i ^ = dct W is fixed by the Galois group and so it belongs to F p . D 
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Theorem 8. Letp, q be primes such that q-th cyclotomic polynomial f(x) = ^2 

p. Suppose W is a circulant matrix associated to w — (wo, ui\, ■ 

then 



0<i<q 



x % is irreducible modulo 



,W q -i) with entries in the finite field ¥ p , 



det W = det 



w Wi ■■■ W q -2 W q -1 
W q -1 Wo ■ ■ ■ W q -3 W q -2 



W2 W 3 ■ ■ ■ W tOi 
Wi W2 ■■■ W q -1 Wo 



= 



if and only if either X)o<i<o-i u>i — or all the w% are equal. 

Proof. We will make use of the notation introduced in the previous theorem here. If J^ i Wi — then Ao = 
and so det W — 0. Suppose all the Wi are equal then for I > 



\p 



w (l + e e + 

V«-i 



Mi- 1 ) 



Wq 



e- 1 
0, since e q = 1. 



Assume that det W = and that Ao ^ 0. Then by the formula for the determinant \e = for some £ < q. 
By the formula for \f, e £ is a root of the polynomial 



p(x) 



0<i<q-l 



However, since q is prime, e is also a primitive g-th root of unity, and the minimal polynomial for a primitive q- 
th over the rationals is the q-th. cyclotomic polynomial f(x) — X)o<i<o-i x% ' This is an irreducible polynomial 
modulo p (by our assumption) and hence is also the minimal polynomial for e l over ¥ p . Thus p(x) is a constant 
multiple of f{x) and thus all Wi are equal. D 



Theorem 9. Let q be a prime and f(x) = X)o<i<o x% ^ e ^ e Q~th Cyclotomic polynomial. Then g{x) is 
irreducible modulo a prime p, iff p = g mod q where g is a generator of the cyclic group F*. 

Proof. Let K = F p . Every root, r, of /(#) over the algebraic closure K, satisfies r q = 1. In other words, 

The smallest extension F„« that contains elements of 



they are elements of multiplicative order q in K . 
multiplicative order q is the smallest s such that p s 



1 mod q. This is the order of p in the multiplicative 



group F* Suppose p = g mod q then the smallest such s is q — 1. 

of the polynomial /(x) and consequently, /(x) is irreducible in F p 

< g — 1. Then, there is an extension of F p , (say) F p 

f'( x ) — llcreGal(F e/F )( x ~ a ( a )) 1S a f ac t° r °f fi x ) which has coefficients in F p of degree e 

Consequently, f{x) is not irreducible over F p . 



The field F p9 -i is the splitting field 

Now if the order of p modulo q is 

that contains a root a of f{x). Now the polynomial 

<q-l. 

□ 



A bit of algebraic number theory gives some more information: The polynomial f(x) is irreducible modulo 
p iff p remains inert in the q-th cyclotomic field Q(/i q ), where fi q — exp(27r* / 'q) . This happens iff the Artin 
symbol at p is a generator of the Galois group Gal(Q(/x g )). By the Chebotarev density theorem this happens 
for a constant density of primes, indeed, the density is y _ 1 . 
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